International  Journal  of  Applied  Mathematics 
& Statistical  Sciences  (IJAMSS) 

ISSN(P):  2319-3972;  ISSN(E):  2319-3980 
Vol.  6,  Issue  4,  Jun  - Jul  2017;  19-36 
© IASET 


Connecting  Researchers;  Nurturing  Innovations 


International  Academy  of  Science, 
Engineering  and  Technology 


GENERALIZED  GAMMA  REGRESSION  MODELS  WITH  APPLICATION  TO 
CD4  CELL  COUNTS  DATA  OF  AIDS  PATIENTS 

ADEKANMBI  D.B 

Research  Scholar,  Department  of  Statistics,  LADOKE  Akintola  University  of  Technology, 

Ogbomoso,  Oyo-State,  Nigeria 


ABSTRACT 


The  gamma  regression  model  is  a sensible  choice  of  model  to  analyze  responses  that  are  continuous,  skewed  and 
take  on  only  positively  valued  integer  outcomes  with  constant  coefficient  of  variation;  of  which  CD4  cell  counts  of  AIDS 
patients  is  a type.  CD4  counts  may  vary  by  level  of  formal  education,  gender,  marital  status  and  age  of  AIDS  patients. 
A detailed  theoretical  framework  of  gamma  regression  was  given  in  this  study,  and  applied  to  retrospective  data  set  of 
AIDS  patients  to  determine  the  relationship  between  the  risk  factors  and  CD4  count  of  the  AIDS  patients.  Three  gamma 
regression  models  were  considered  with  three  different  links  for  the  mean,  namely:  log,  identity  and  inverse.  The  choice  of 
link  function  for  the  gamma  regression  is  very  critical  to  the  accuracy  of  the  model.  There  appears  to  be  a linear  positive 
effect  of  sex,  level  of  education,  and  marital  status  and  a negative  effect  of  age  variable  on  the  CD4  counts  of  the  AIDS 
patients.  All  the  three  models  showed  significant  positive  impact  of  sex  on  the  CD4  counts  of  AIDS  patients. 
The  difference  between  log  link  and  identity  link  was  minimal.  The  gamma  regression  model  with  inverse  link  function  fits 
poorly,  while  gamma  regression  models  with  an  identity  link  seems  to  provide  a more  precise  fit  to  the  AIDS  data,  and  was 
therefore  preferred.  The  result  showed  that  older  patients  have  reduced  CD4  cell  counts  compared  to  the  younger  AIDs 
patients,  while  males  generally  have  higher  CD4  counts  than  females.  However,  all  the  three  gamma  regression  models 
failed  to  capture  the  nature  of  the  observed  distribution  of  the  CD4  cell  counts.  The  models  were  evaluated  by  the 
comparison  of  their  deviances  and  the  Akaike  Information  criterion.  Diagnostic  evaluation  of  the  models  revealed  no  major 
problem  in  the  models,  except  for  a few  non-influential  outlets  that  were  identified.  Based  on  the  visual  and  empirical 
evidences;  the  fit  of  the  reciprocal  model  is  therefore  preferred  for  modeling  the  AIDS  data.  The  results  of  this  study  have 
the  potential  to  be  useful  for  health  workers  attempting  to  determine  factors  associated  with  improved  health  of  AIDS 
patients,  and  for  policy  makers  who  are  interested  in  costs  and  outcomes  associated  with  treatment  of  AIDS  patients. 

KEYWORDS:  Acquired  Immune  Deficiency  Syndrome,  CD4  Cell  Count,  Gamma  Regression,  Link  Functions,  Akaike 
Information  Criterion  (AIC) 

1.0  INTRODUCTION 

AIDS  is  an  acronym  for  Acquired  Immune  Deficiency  Syndrome  disease.  It  is  a disease  of  the  immune  system  that 
is  caused  by  Human  Immunodeficiency  virus,  (HIV).  AIDS  is  the  final  stage  of  HIV  infection  during  which  time  total 
infectious  viruses  fight  the  body  system,  [6,  12,  29].  HIV  could  be  transmitted  from  an  infected  person  to  an  uninfected 
person  by  the  direct  transfer  of  bodily  fluids  such  as  blood  products,  breast  milk,  semen  and  other  genital  secretions. 
The  identified  primary  source  of  transmission  is  through  sexual  intercourse,  either  homosexual  or  heterosexual,  [3,  12,  20, 
31].  Presently,  there  is  no  vaccine  or  cure  for  AIDS.  However,  an  antiretroviral  treatment  is  believed  to  reduce  the  risk  of 
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infection,  but  never  cures  the  patient  nor  alleviates  the  symptoms,  [16,  39].  It  was  estimated  that  a total  of  over  60  million 
adults  were  infected  in  year  2000,  with  the  63%  of  them  from  sub-Sahara  Africa,  [3,  37,  38].  In  Nigeria,  the  prevalence  rate 
has  risen  froml.  8%  in  1991,  3.8%  in  1993,  4.5%  in  1995,  5.4%  in  1999,  5.8%  in  2001,  and  it  has  slightly  declined  to  5% 
in  2003,  and  4.4  % in  2005,  [12,  29].  The  highest  percentage  of  reported  AIDS  diagnoses  occur  in  the  age  group  20-40, 
which  consequently  resulted  in  a reduction  in  the  workforce  and  loss  of  productivity,  with  estimated  4.5  years  decrease  in 
life  expectancy  of  this  age-group  due  to  the  disease,  [20,  38].  Women  represent  the  fastest  growing  group  with  HIV 
infection,  with  the  highest  rate  in  sub-Saharan  region,  [19,  34,  38].  At  the  end  of  2010,  it  was  reported  that  an  estimated  34 
million  people  were  living  with  HIV  globally,  which  includes  3.4  million  children  less  than  15  years,  [12], 

CD4  cells,  also  called  T-lymphocytes  are  white  blood  cells  that  protect  from  viral  infections  by  producing 
antibodies,  and  are  the  body’s  natural  defense  system  against  pathogens,  infections  and  illnesses,  [5,  13,  17].  Once  a person 
is  infected  with  HIV,  the  virus  attacks  and  destroys  the  CD4  cells  of  the  person  immune  system,  which  cause  the  number  of 
cells  to  decrease  over  time,  [13,  32].  A CD4  cell  count  is  the  measurement  of  the  number  of  blood  cells  in  a cubic 
millimeter  of  blood  and  is  the  most  important  laboratory  indicator  of  the  health  of  a person’s  immune  system, 
[5,  17,  18,  30].  A higher  number  of  CD4  count  indicates  a stronger  immune  system  to  fight  HIV  and  other  infections,  [42], 
It  is  reasonable  to  monitor  any  trends  in  changes  to  the  CD4  count  of  an  AIDS  patient  over  time.  The  CD4  count  of  an 
uninfected  adult  who  is  in  good  health  ranges  from  500  cells/mm3  to  1500  cells/mm3.  People  that  are  HIV  positive  who 
have  a CD4  cell  count  over  500  are  regarded  as  being  in  good  health,  while  those  with  CD4  count  below  200  cells/mm3  are 
at  significant  risk  of  developing  serious  illnesses  and  infections,  [5,  17,  18,  29,  30,  40].  Some  studies  reported  sex 
differential  in  CD4  cell  counts  of  AIDS  patients,  indicating  that  CD4  counts  are  lower  in  women  compared  to  men, 
[13,  16,  19,21,32,35,36], 

The  gamma  distribution  is  suitable  as  a lifetime  model,  [14,  24].  The  gamma  distribution  can  be  viewed  as  a 
generalization  of  the  exponential  distribution  with  mean  1/ A , which  represents  the  waiting  time  until  the  first  event  occur, 

where  the  events  are  generated  by  a Poisson  process  with  mean  A . The  gamma  random  variable  therefore  represents  the 
waiting  time  until  the  nth  event  to  occur,  [24,  41],  The  gamma  distribution  hangs  on  the  assumption  that  all  waiting  times 
are  complete  at  the  end  of  the  study,  so  that  censoring  is  not  allowed.  The  gamma  regression  model  is  applicable  if  the 
response  has  a gamma  distribution,  [11].  When  the  distribution  is  positively  skewed  and  has  variance  increasing  with 
mean,  then  gamma  distribution  is  appropriate,  [28].  The  assumption  of  constant  coefficient  of  variation  (CV)  in  gamma 
regression  can  be  verified  by  grouping  the  data  set  into  intervals  based  on  the  value  of  estimated  mean,  and  estimate  the 
CV  in  each  interval.  Plots  of  CV  against  mean  should  reveal  any  systematic  departure  from  constancy,  [28].  The  model  has 
a wide  range  of  application  in  the  medical  field,  [23,  24]. 

This  study  is  focused  on  gamma  regression  modeling  of  AIDS  cases  in  Nigeria,  to  determine  factors  that  are 
significant  in  improving  the  CD4  cell  counts  of  AIDS  patients;  and  to  also  determine  the  most  suitable  link  function  for  the 
gamma  regression  for  the  data  set,  among  the  existing  link  functions  for  gamma  regression  models.  After  the  introduction, 
this  study  is  sectionalized  into  six  sections.  In  section  2,  a detailed  description  of  the  data  used  in  this  study  is  presented. 
In  section  3,  a full  detail  of  the  theory  of  gamma  distribution  and  gamma  regression  is  presented.  A measure  of  model 
selection  is  summarized  in  section  4.  Application  of  the  gamma  regression  based  on  the  AIDS  data  is  discussed  in  section 

5.  Results  of  application  cases  base  on  the  three  different  link  functions  are  also  presented  and  discussed.  Finally,  in  section 

6,  various  issues  arising  from  the  study  are  discussed. 
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2.0  DATA 

Data  on  AIDS  patients  used  in  this  study  were  extracted  from  the  records  of  the  Obafemi  Awolowo  Teaching 
Hospitals,  Nigeria.  In  the  data  set,  there  were  407  AIDS  cases,  and  the  demographic  and  clinical  variables  recorded  for 
each  patient  were  gender,  level  of  formal  education,  marital  status,  age  of  patients  at  death,  and  the  CD4  cell  count  of  each 
of  the  HIV  patients.  The  response  is  therefore  the  CD4  cell  counts  of  the  HIV  patients.  The  response  by  its  nature  is  always 
continuous,  non-negative  positively  skewed,  and  does  not  exhibit  the  same  variability  at  all  levels. 

3.0  METHODOLOGY 

Generalised  linear  models  (GLM)  are  extension  of  classical  linear  regression  usually  formulated  with  a purpose  of 
predicting  the  outcome  of  a response  as  a function  of  some  linear  combination  of  a set  of  predictors  or  explanatory 
variables,  [10,  11,  25,  26,  28].  In  order  to  formulate  a GLM,  a link  function  is  required,  which  relates  the  linear  predictors 
to  the  predicted  mean  of  response;  and  also  required  is  a function  defining  the  error  probability  distribution  around  the 
mean.  Examples  of  distributions  that  belong  to  the  exponential  family  are  Normal,  Binomial,  Poisson,  Gamma,  and  so  on. 

Given  predictors  Xj  , the  mean  of  response  variable  can  be  expressed  in  terms  of  the  linear  combination  of 


predictors,  such  that: 

n,=x^p 

a) 

Mi  = g(m)  = ^P 

(2) 

~Po  + PiXii  + + PkXik 

Where 


r|j : is  the  linear  predictor. 
gQ  : is  the  link  function. 

m.  =e(y,|x1) 


The  link  function  is  invertible,  so  that 

m =g‘I(n1)  = g"1(^p) 

= g '(a  + P^i,  +P2Xi2  + + PkXik) 


The  three  common  link  functions  for  gamma  regression  are,  [10,  11,  28]: 

(i)  The  log  link:  g^)  = 1°g^) 

(ii)  The  identity  link:  ^ 


(iii)  The  inverse  link: 


gW= 


1/ 


M1 


(3) 
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The  log  link  should  be  used  when  the  effect  of  the  predictors  is  suspected  to  be  multiples  of  the  mean.  So  that  the 
effect  of  the  predictors  Xj  on  [X  is  multiplicative  and  not  additive.  The  log  link  is  considered  because  CD4  counts,  which  is 
the  response  variable  must  be  positive  and  log  link  yields  positive  values.  The  identity  link  is  adequate  for  modeling 
variance  components  which  have  chi-square  distribution.  Gamma  regression  with  identity  link  function  is  also  adequate 
when  the  effect  of  each  predictor  is  considered  additive  on  the  original  scale.  Since  - °o  < r|  < °° , the  inverse  link  does  not 

guarantee  |X  > 0 , and  this  could  cause  problems  which  might  require  restrictions  on  (3  , [11].  Essentially,  the  choice  of  link 
function  is  rather  subjective,  [11]. 

3.1  Re-Parameterization  of  Gamma  Distribution 

The  two-parameter  gamma  distribution  has  a density  function  of  the  form: 

f(y;A,,a)  = — ^-(A,y)“~1exp(-  Xy)  y > 0,and  X,  a>  0 (4) 

r(a) 

Where 

a : shape  parameter 
X : scale  parameter  of  the  distribution. 

Then  Y ~ G(a,A,) 

The  properties  of  a Gamma  distribution  are  therefore 


E(Y,)  = - = n, 


(5) 


Var(Yj ) = 77  = f— 1 = (E^ ))’ 

Va7 


X2 


(6) 


According  to  [7,  8],  the  gamma  density  function  in  (4)  can  be  re-parameterized  in  terms  of  the  mean  and  shape 

a , a 

parameters  by  setting  [X  = — then  X = — .So  that  (6)  becomes 

X ft 


fY(y|n,a)  = — * 


r(a) 


'~va  y-1 

-y 

[X  ) 


a 

W 


vt 


exp 


.ay/ 


(7) 


fY  (y||x,a)dy=  — 

r» 


/ N a 

ay 

Y V-  X 


exp 


.ay/  U 


py  y 


dy 


(8) 


Since  d(ln(y))  = y/dy 

(8)  Is  referred  to  as  G([l,a) , which  implies  that  y follows  a gamma  distribution  with  mean  [X  and  a as  a shape 
parameter. 
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3.2  Gamma  Distribution  as  a Member  of  the  Exponential  Family 

Gamma  density  having  the  density  function  of  the  form  (7),  so  thatY  ~ G(|A,(x).  Rearranging  the  density 
according  to  the  exponential  form,  then 


f(y)  = exp 


■ log(p) 

h aloga  + (a  - l)logy  - logr(a) 


So  that 


(9) 


9 = and  &{(p)  = (p,  <p  = — 

ft  a 

And 

b(e)  = iog(n)  = iog(—  i/e)  = — iog(—  e) 

Then 

b'(e)  = “T^  = _1/e  = tl 
b'(e)=^=^ 

The  canonical  parameter  is , so  that  the  canonical  link  function  is  therefore  g([i)  = . The  dispersion 

9 0 

parameter  is <p  — — . For  any  gamma  distribution,  |I  > 0 , since  the  distribution  is  only  defined  for  y>0,  [24,  26]. 

a 

Restriction  must  therefore  be  placed  on  the  vector  (3  to  ensure  the  expected  value  is  positive. 

3.3  Model  Formulation  of  Gamma  Regression 

Let  y.  ~ G(m  ,a)  i = 1,2, n be  independent  random  variables,  then  the  gamma  regression  model  is  given 

as: 

n1=g(b1)  = xIlp  do) 

Where 

/ 

X ■ = (xjj, xik)  • ^le  vector  °f  k covariates.  Usually  Xn  =1,  for  alii,  so  that  model  has  a mean 

intercept. 

/ 

P = (Pi, Pk)  : is  the  vector  of  unknown  regression  parameters,  (k  < n). 
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r|j : is  a linear  predictor, 
g:  the  link  function 

[7]  Proposed  that  the  shape  parameter  is  not  constant  through  the  observations  and  could  be  modeled  following 

regression  structure.  So  that  y;  ~ G((.l;,a)  i = 1,2, n are  independent  random  variables  with  gamma  distribution. 

The  mean  and  shape  parameters  follow  a regression  structure  given  as: 

=g(m)  = <P  (id 

Tla  =h(<Xi)  = z'y  (12) 

Where 

P = (Pi> Pk)  andY  = (yj, Yj)  with  k+j  <n  , are  vectors  of  regression  parameters  which  are 

related  to  the  mean  and  dispersion. 

G:  is  the  mean  link  function. 

H:  is  the  shape  link  function. 

T|li  and  T]  ^ : are  the  linear  predictors. 


Xj  and  Zj : are  the  mean  and  shape  explanatory  variables  for  the  ith  observation. 


Gamma  regression  has  been  found  adequate  in  modeling  data  in  which  the  coefficient  of  variation  is  constant,  [9, 

26,  28], 


VvjjKyJ  _ VaAf  _ i 

E(y, ) a/Xj  Va 


3.4  Parameter  Estimation  of  Gamma  Regression  Model 

Generalised  linear  models  can  be  fitted  to  data  by  the  method  of  maximum  likelihood,  providing  not  only 
estimates  of  the  regression  coefficients  but  also  estimated  asymptotic  standard  errors  of  the  coefficients,  [10,  28].  Given 

that  a is  known  in  Y ~ C)(//,  (/.) , under  the  re-parameterisation  of  gamma  density  function  given  in  (7),  the  likelihood 
function  of  the  gamma  regression  models  of  (1 1)  and  (12)  is  given  by: 


L=n 


"a  Vi 


r(a,)U,y 


y“HexP 


P,  , 


The  log-likelihood  function  is  therefore. 


1 = 1 


log[r(a, )]+  a,log 

^ ai  Yi  1 

-log(yj)- 

y,) 

l P,  J 

vPi, 

J 

(14) 


(15) 
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Assuming  the  systematic  components  |i  = X and  a t = exp(z;y),  the  components  of  the  score  function  are 


therefore: 


31  _ y,  a; 

h ^ 

31  ^ 

= 2>ai 


r \ 


1- 


xij 


j = 1,2,- 


3y 


k i=l 


logT(a;)  log 

df 

-i+* 

da; 

l j 

j 

k = 1,2 r 


These  derivatives  can  then  be  assembled  to  give  a vector  of  efficient  scores,  u(p) . 
U(P): 


f 31  31  AT 


v3Pj  ^ ) 
u(p)  = X'W(k)Y 
Where 

X:  is  the  design  matrix. 

W(k):  a diagonal  matrix  with  elements  CO; . 

♦ 

Y : a vector  with  elements  y ; 

The  components  of  the  Hessian  matrix  are: 


321  _ya; 

3pkPj 

321 


2y5 


l- 

v m ; 


xijxik 


j,k  = 1,2, p 


=Z-‘. 


logr(a,)  log 

r \ 

a,y; 

-1  + — 

da; 

l l1;  J 

li,  J 

^7k  Pj  i=i 

321  ^ 

= / -a, 

^TkTj  tr 

The  expectations  on  both  sides  of  the  equation  (19),  (20)  and  (21)  yield 


logr(a;)  log 

zf 

-1  + ^ 

da; 

l l1!  J 

11,  J 

k = 1,2 r 


k = 1,2, r 


321 

^YkPj 


= Z"%X  jixA  j’k  = l,2,....p 


= 0 k = 1,2,. ...r  and  j = 1,2 p 


(16) 


(17) 


(18) 


(19) 


(20) 


(21) 


(22) 


(23) 
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ar 


dar 


-logr(a,)-  — 
a; 


z z , 

lj  ik 


j,k  = l,2,...r 


(24) 


The  Fisher  information  matrix  is  a block  diagonal  matrix,  with  one  of  the  blocks  corresponding  to  the  mean 
regression  parameters,  while  the  other  corresponds  to  the  shape  regression  parameters.  It  follows  that  the  Fisher 
information  is 


— F 

f a2i  ] 

— F 

' a2i  \ 

I(p)= 

— F 

t^PrPj 

( a2i 

12/ 

— F 

V^YkPk, 

f a2i  ] 

JU/ 

vaTkPk  j 

12/ 

laPkpJ 

(25) 


10) 


o 


yLylogF(ai) 

daf 


ai 


z z. 

ij  ki 


(26) 


In  matrix  form,  this  is 

l(p)  = X'W(k)X  (27) 


An  iterative  algorithm  to  obtain  the  maximum  likelihood  estimates  of  the  gamma  regression  parameters  has  been 

/ 

proposed,  [8,  28].  Given  the  parameters  values  y^)  , the  mean  vectsincef  the  regression  parameter  is  updated  from: 


P(k+D  =P(k)+I-i(p(k))U(p(k)) 

P(k+D  = P(k)  + (x'w/^x)-1  X'w/k)Y 


Where 


w'k)  : is  a matrix  with  diagonal  elements  of  W;k 


Ml 


l(k) 


.W 


(28) 

(29) 


Given 


/ 

initial  values  (p^k+a,y^)  the  shape  parameters  y^k+1^  could  be  updated  from 


,(k+l) 


y = (z/wik)z)"1X'wik)Y 

Where 


w 


(k) . • 


(k)_  1/ 


: is  a diagonal  matrix  with  elements  W-  = X^(k) 


(30) 
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3.5  Gamma  Regression  Residual  Analysis 

The  main  purpose  of  residual  analysis  in  a generalized  linear  model  is  to  identify  model  mis-specification  or 
outliers.  Since  a well  fitting  model  is  a prerequisite  for  reliable  inferences,  it  is  therefore  necessary  to  inspect  the  quality  of 
fit  provided  by  the  gamma  regression  model,  after  fitting  the  model  to  a set  of  data. 

3.5.1  Residual  Deviance 

The  lack  of  fit  in  gamma  regression  is  measured  by  deviance.  It  provides  a measure  of  the  discrepancy  between 
the  model  and  the  data.  A large  value  of  deviance  indicates  a poor  fit,  while  a small  value  of  deviance  indicates  a good  fit, 
[25].  Given  that 

Y G((J,j,(l)  Independent  L l — 6X  p(x‘p)  (31) 


l([i,a,y)  = |j|a  — ^--log(^)  -log(r(a))  + alog(ayi)-log(yi) 


(32) 


n 


(33) 


i=l 


(34) 


(35) 


Then  the  deviance  residual  is  therefore 


(36) 


D(y,A)~cpXn-P 

Where  cp  = l/a  and  ft,  = g 1 (x.p) 
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Reject  model  (10)  at  0.05  significance  level  if  A)  > ^2  _ ^ 

<P  n“PJ"  ' 

So  that  a calculated  deviance  that  exceeds  the  upper  100(1  — a)  percent  point  of  the  %n-p  distribution  indicates  a 
poor  fit  to  the  data  at  the  lOOtT  percent  significant  level. 

3.5.2  Standardised  Residual 

The  standardized  residual  identify  any  observations  that  give  a disproportionately  large  contribution  to  the 
deviance.  For  gamma  regression,  the  standardized  residual  is  defined  as  follows: 

r.  = Yi  ~Ai  (37) 

‘ v/Vaf(y,) 

Where 

Var(y;)  = A—  (38) 

&J 


4.0  MODEL  SELECTION:  AKAIKE  INFORMATION  CRITERION  (AIC) 


For  competing  generalized  linear  regression  models,  the  best  model  can  be  determined  by  taking  into  account  the 
number  of  parameters,  using  the  Akaike  Information  Criterion  (AIC).  AIC  measure  describes  the  tradeoff  between  bias  and 
variance  in  model  construction,  or  between  accuracy  and  complexity  of  the  model,  [11,  41].  It  is  a measure  of  fit  that 
penalizes  for  the  number  of  parameters. 


AIC  = D + 2p 

= - 21mod  + 2P 


(39) 


Where 


D:  deviance  statistic 


p:  number  of  parameters  in  the  linear  predictor  of  the  model  under  consideration. 
Imod  • Log-likelihood  of  the  fitted  model. 


When  models  differ  in  terms  of  their  link  functions  or  predictors,  comparing  AIC  statistic  is  straightforward. 
However  the  same  data  should  be  fitted  by  models  that  are  being  compared.  Smaller  values  of  AIC  indicate  better  fit, 
and  thus  the  AIC  can  be  used  to  compare  models,  whether  nested  or  not,  [25]. 

5.0  RESULTS  OF  GAMMA  REGRESSION  ANALYSIS  OF  AIDS  DATA 

Figure  1 is  the  scatter  plot  matrix  of  the  CD4  cell  counts  against  the  predictors  which  shows  clear  evidence  of  a 
continuous  positively  skewed  curve;  and  the  variance  increases  rapidly  with  the  mean,  especially  with  age.  This  therefore 
suggests  that  a gamma  regression  might  be  an  appropriate  model  for  the  data.  In  fact,  since  the  response  variable,  CD4  cell 
counts  are  positively  skewed  continuous  variable,  suggesting  that  gamma  regression  should  be  an  appropriate  model 
structure  for  the  data. 
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Figure  1:  Scatter  Plot  Matrix  of  the  AIDS  Variables 
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no  education  primary  secondary  undergraduate  Graduate 


Level  of  education 

Figure  2:  Box  Plot  for  CD4  Counts  Data,  by  Level  of  Education 


Figure  3:  Box  Plot  of  CD4  Counts  Data  by  Marital  Status 


Figure  4:  Box  Plot  of  CD4  Counts  Data  by  Age  of  Patients 
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Figure  5:  Box  Plot  of  CD4  Counts  Data  by  Sex 


Table  1:  MLE  of  the  Parameter,  Standard  Errors,  and  P-Values  for 
Gamma  Regression  Models  Fitted  to  the  AIDS  Data 


Effect 

Method 

Estimate 

SE 

P-Value 

Intercept 

Log 

5.7207 

0.0398 

2e-16*** 

identity 

304.7857 

12.5586 

2e-16*** 

inverse 

0.0033 

0.00013 

2e-16*** 

Sex 

Log 

0.0741 

0.0250 

0.0034** 

identity 

24.0057 

8.0886 

0.0034** 

inverse 

-0.0002 

0.0007 

0.0036** 

Age 

Log 

-0.0011 

0.0015 

0.4962 

identity 

-0.3512 

0.4932 

0.4773 

inverse 

0.000003 

0.000005 

0.5150 

Level  of  Education 

Primary  Education 

Log 

0.0605 

0.0374 

0.1072 

identity 

20.0782 

12.0473 

0.0972 

inverse 

0.00018 

0.00012 

0.1187 

Secondary  Education 

Log 

0.0225 

0.0365 

0.5371 

identity 

7.9535 

11.5383 

0.4914 

inverse 

0.000006 

0.00011 

0.5827 

Tertiary 

Log 

0.0145 

0.0522 

0.7811 

identity 

5.4073 

16.3549 

0.7413 

inverse 

0.00004 

0.00017 

0.8187 

Graduate 

Log 

0.0408 

0.0398 

0.3071 

identity 

13.3783 

12.7832 

0.2966 

inverse 

0.00012 

0.00011 

0.3204 

Marital  Status 

Married 

Log 

0.0522 

0.0409 

0.2028 

identity 

16.9143 

13.0257 

0.1956 

inverse 

0.00016 

0.00013 

0.2097 

Divorced 

Log 

0.0317 

0.0497 

0.5248 

identity 

10.4324 

15.7862 

0.5095 

inverse 

0.00009 

0.00016 

0.5381 

Separated 

Log 

0.1160 

0.1057 

0.2741 

identity 

38.2207 

37.1342 

0.3046 

inverse 

0.00035 

0.00030 

0.2430 

Significant  Codes:  0 (* 

*)  0.001 

Table  2:  Residual  Deviance  and  AIC  for  the  Gamma  Models 


Residual  Deviance 

Residual  D.F 

AIC 

Model  1 

6.1044 

197 

2265.4 

Model  2 

6.1014 

197 

2265.3 

Model  3 

6.1075 

197 

2265.5 
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The  degree  of  symmetry  of  the  predictors  can  be  judged  from  the  compound  box  plots  in  Figures  1.  Figures  2 
shows  the  box  plots  for  the  CD4  counts  by  level  of  education  of  patients,  and  are  all  positively  skewed  with  no  outliers, 
except  for  the  secondary  group  which  is  negatively  skewed.  Figure  3 is  the  box  plot  of  CD4  counts  by  marital  status  of 
AIDS  patients.  Noticeable  from  the  box  plot  is  that  the  CD4  counts  tend  to  be  a little  higher  for  single  and  married 
compared  with  the  divorced  marital  status  group.  There  are  obvious  outliers  in  the  CD4  counts  for  married,  and  each 
marital  status,  distribution  shows  evidence  of  right  skewed,  except  the  separated  status  that  is  so  skewed  that  the  minimum, 
median  and  third  quartile  are  all  equal  to  this  marital  group.  Figure  4 shows  the  compound  box  plot  for  CD4  counts  by  age 
of  AIDS  patients.  For  all  the  ages,  the  distribution  is  positively  skewed,  except  for  ages  26,  32,  36,  38  and  49,  and  outliers 
can  be  seen  at  ages  23,  26,  28  and  50.  Figure  5 which  is  the  box  plot  for  CD4  counts  by  sex  shows  that  the  distribution  of 
CD4  counts  for  female  is  positively  skewed,  with  a few  outliers  while  that  of  males  is  negatively  skewed.  Males  appear  to 
have  higher  CD4  counts  than  females  across  the  time  period  considered. 

In  the  analysis,  the  response  variable  is  the  CD4  counts  of  AIDS  patients,  while  level  of  formal  education  of 
patients,  marital  status,  and  age  sex  is  the  predictors.  Table  1 provides  the  result  of  fitting  the  three  gamma  regression 
models  to  the  AIDS  data,  giving  the  estimates  of  the  regression  parameters,  their  standard  errors,  and  their  corresponding 
p-values  for  the  three  gamma  regression  models.  Model  1 contain  information  for  the  log  link  fit,  model2  contain 
information  for  the  identity  link,  while  model3  contain  information  for  the  inverse  link  fit.  However,  all  the  three  gamma 
regression  models  failed  to  capture  the  nature  of  the  observed  distribution  of  the  CD4  counts.  The  estimated  coefficients  of 
the  three  models  are  noticeably  different  from  each  other.  The  three  models  can  be  evaluated  based  on  their  residual 
deviance  and  AlC-based  model  selection.  In  fact  the  three  models  yield  similar  results  with  little  difference  in  their  residual 
deviance  and  their  AIC  values.  As  shown  in  Table  2,  the  residual  deviance  of  modell  is  6.1044  on  197  d.f  and  AIC  of 
2265.4,  model2  yields  residual  deviance  of  6.1014  on  197  d.f  with  AIC  of  2265.3,  while  the  residual  deviance  of  model3  is 
6.1075  with  AIC  of  2265.5.  The  model  with  smaller  AIC  is  preferred,  so  that  model2  with  identity  link  fit  is  preferred 
above  the  inverse  and  the  log  link  fit.  Among  all  the  predictors  considered,  the  three  gamma  models  consistently  indicate 
that  only  sex  has  significant  impact  on  the  CD4  counts  of  the  AIDS  patients.  For  modell,  there  are  exp  (0.0741)  = 1.0769 
times  as  much  CD4  for  males  relative  to  females  with  other  variables  held  fixed.  The  comparable  figure  for  sex  in  the 
identity  link  gamma  model  is  24.01.  Age,  level  of  education  of  patients  and  their  marital  status  are  not  significant  in 
predicting  the  CD4  counts  of  AIDS  patients. 

The  usual  diagnostics  were  performed  for  the  three  models,  as  shown  in  Figures  6 respectively.  The  plots  of  the 
jackknife  deviance  residuals  against  the  fitted  values  shown  on  the  top  left  panel  in  the  Figures  7,  revealed  no  appearance 
of  systematic  trend.  The  plot  on  the  top  right  of  Figures  8 is  the  normal  QQ  plot  of  the  standardized  deviance  residuals,  and 
shows  that  the  standardized  residuals  are  normally  distributed.  On  the  left  side  of  the  bottom  panel  is  the  plot  of  the  Cook 
statistics  against  the  standardized  leverages,  which  identifies  three  observations  to  the  right  of  the  vertical  line  as 
observations  with  likely  high  leverage  compared  to  the  variance  of  the  raw  residual  at  that  point.  The  last  plot  on  the  right 
side  of  the  bottom  panel  shows  the  cook  statistics  plotted  against  case  number,  which  also  revealed  that  some  observations 
are  influential  on  the  models. 
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Figure  6:  Residual  Plot  for  Model  1 


Figure  7:  Residual  Plot  for  Model  2 


Figure  8:  Residual  Plot  for  Model  3 


6.0  DISCUSSIONS 

Generally,  GLMs  are  the  regression  modeling  of  non-normal  data.  Gamma  regression  is  a reasonable  model  for 
the  AIDS  data  since  the  response  variable,  CD4  cell  counts  is  a continuous,  positive  outcome  with  constant  coefficient  of 
variation.  Three  gamma  models  were  considered  for  the  AIDS  data.  Before  fitting  the  gamma  models  to  the  AIDS  data, 
outliers  were  detected.  Regardless  of  the  distribution  of  CD4  counts,  for  the  inverse  link,  the  estimates  of  the  parameters  of 
the  model  were  very  poor.  This  link  was  deemed  a poor  choice  for  the  data,  but  the  link  could  adjust  in  a better  way  for 
different  data  set.  Comparison  of  the  fit  indices  of  the  three  models  revealed  that  the  gamma  regression  model  with  an 
identity  link  provided  a better  fit  compared  to  the  gamma  regression  model  with  log-link,  and  with  inverse  link. 

The  gamma  regression  model  is  not  without  some  limitations.  Generally,  GLMs  tend  to  be  inefficient  in  the 
presence  of  heavier  tails.  Gamma  model  is  not  consistent  when  the  data  are  heteroscedastic  in  nature.  Specification  of  an 
adequate  gamma  regression  model  for  a data  set  requires  a careful  examination  of  the  characteristics  of  the  data.  Further 
studies  of  AIDS  patients  with  large  cases  and  most  predictive  variables  are  therefore  recommended  to  confirm  the  results. 
Despite  the  limitations,  the  results  are  valuable  in  understanding  the  role  of  CD4  cell  counts  in  response  to  the  risk  factors. 
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CONCLUSIONS 

The  results  of  the  gamma  regression  analysis  of  AIDS  patients  have  the  potential  to  be  useful  to  medical  doctors 
and  health  workers,  attempting  to  determine  factors  that  could  help  in  improving  the  CD4  counts  of  AIDS  patients, 
and  thereby  achieve  improved  health  of  the  patients.  The  result  of  this  study  should  be  valuable  to  policy  makers  in 
evaluating  the  cost  effectiveness  of  factors  that  could  impact  positively  on  the  health  of  AIDS  patients. 
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